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Abstract: Duplicate Bug Report Detection (DBRD) is one 
of the famous problems in software triage systems like 
Bugzilla. There are two main approaches to this problem, 
including information retrieval and machine learning. The 
second one is more effective for validation performance. 
Duplicate detection needs feature extraction, which is a time- 
consuming process. Both approaches suffer runtime issues, 
because they should check the new bug report to all bug 
reports in the repository, and it takes a long time for feature 
extraction and duplicate detection. This study proposes a 
new two-step classification approach which tries to reduce 
the search space of the bug repository search space in the first 
step and then check the duplicate detection using textual 
features. The Mozilla and Eclipse datasets are used for 
experimental evaluation. The results show that overall, 
87.70% and 89.01% validation performance achieved 
averagely for accuracy and Fl-measure, respectively. 
Moreover, 95.85% and 87.65% of bug reports can be 
classified in step one very fast for Eclipse and Mozilla 
datasets, respectively, and the other one needs textual feature 
extraction until it can be checked by the traditional DBRD 
approach. An average of 90% runtime improvement is 
achieved using the proposed method. 

Keywords: Duplicate Detection, Bug Report, Machine 
Learning, Runtime Performance, Search Space Reduction 


1. Introduction 

Duplicate detection is one of the essential and time- 
consuming operations in social communities like software 
repositories of bug reports (e.g., Bugzilla) or question and 
answering forums (e.g., Stack Overflow). There has been 
about 30% to 60% duplicate bug reports in various software 
repositories, especially open-source projects, and it is 
growing every day with growing their communities [1]. 
Duplicate detection needs to compare a new bug report to all 
bug reports of the repository. The comparing process is not 
straightforward because bug reports contain many data fields 
with various domains (e.g., identity, temporal, categorical, 
and textual domains). The textual data fields cannot be 
compared simply because two texts may have the same 
content but different forms and words. So, feature extraction 
should be used to convert bug reports as unstructured data to 
structured data [2]. There are many efforts on feature 
extraction, like using time difference of temporal data fields 
[3], textual features considering term frequencies [4], and 
subsequence matching [5, 6], using similarity of bug reports 
to specific topics as contextual features [7, 8, 9]. By the way, 
there are some issues for feature extraction, especially for 
textual data fields, e.g., stemming, removing the stop words, 
correcting typos [10, 11, 12, 13, 14], which can improve the 
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validation performance of duplicate bug report detection 
(DBRD). 

After feature extraction, the features of a pair of bug 
reports, including a new bug report and another one from the 
repository, should be checked for duplication. The 
Information Retrieval (IR) approach checks the similarity of 
these features to the features of other pairs of bug reports. If 
the two feature vectors were very similar, they would be 
reported as duplicates. Machine Learning (ML) approach 
tries to learn the features of duplicate pairs and predict the 
label of a new pair without comparing it to other pairs, 
usually [15]. ML approach is a little faster than IR approach 
because, after feature extraction, it uses the ML algorithm to 
predict the duplication, but IR approach compares the feature 
to other features that take a long time again [16]. 

Duplicate detection is a binary operator that needs two 
bug reports, and we cannot say a bug report is duplicated 
without considering other bug reports. It is challenging 
because of the massive number of bug reports in the 
repository. If we suppose every feature extraction and 
duplicate detection using ML algorithms take just 1 second 
—even though it can take more time based on the feature 
extraction methods, especially for textual features-, for a bug 
report repository containing 10,000 bug reports, it takes 
10,000 seconds, which is about 2.7 hours. So, this approach 
cannot be used for online DBRD. Besides, some feature 
extractors like the longest common subsequence sometimes 
take more than | second to calculate. 

Offline DBRD has no time limit. It has a repository of bug 
reports and tries to find duplicate bug reports like a clustering 
problem that categorizes data in some clusters. Here, the 
clusters contain those bug reports that are related and 
duplicated. Online DBRD tries to find a duplicate of new bug 
reports as it wants to be submitted in the repository and even 
helps the writer avoid submitting duplicated bug report real- 
time. The continuous query is a kind of online DBRD that 
repeatedly checks duplications [17, 18], and the time 
complexity is very important in the online DBRD versus the 
offline version. Such complexity is the major problem of 
online DBRD, which is currently lacking knowledge. 

This study focuses on the runtime performance as a 
significant online DBRD objective to avoid comparing a new 
bug report to all bug reports of the repository in online 
DBRD. The significant difference between this study and 
related works is that this stufy considers runtime challenges 
for online DBRD, not just the validation performance. The 
main contributions of this study are: 

1. Introducing a novel two-step classifying approach for 
improving the runtime performance of DBRD based on 
two light and full classifiers. The first one uses faster and 
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easier features, and the second one uses all the time- 
consuming features; 

2. Using voting ensemble approach to improve the 
validation performance of the proposed online and two- 
step DBRD. 

This study’s fundamental hypothesis is that a two-step 
filtering-based classification approach reduces the feature 
extraction runtime for online DBRD. 

Section 2 will review the machine learning approach, and 
Section 3 introduces our proposed machine learning 
algorithm. Section 4 includes the results of the experiments, 
and Section 5 concludes the study. 


2. Literature review 

The following first sub-section will introduce methodology 
of Duplicate Bug Report Detection (DBRD) and then the 
feature extraction methods will be illustrated to clear 
demonstration of examples about proposed method. 
Moreover, a comparative tabular review on the related works 
will be summarized to show lack of runtime improvement in 
state-of-the-arts. 

2.1, The methodology of Duplicate Bug Report 
Detection (DBRD) 

Figure | shows the traditional approach of duplicate bug 
report detection (DBRD). The bug reports of a dataset will 
be pre-processed in the first step (box 2). There are many 
pre-processing operations such as dealing with null values, 
homogenizing data types, cleaning textual fields, and 
preparing for feature extraction. Then pairs of bug reports 
should be selected for duplicate checking (box 4). 

This methodology is for offline DBRD, but it can also be 
used for online DBRD. In online DBRD, there is no need to 
select pairs of bug reports. The new bug report can be paired 
with all bug reports of the repository instead. Then, feature 
extraction will be used to extract various types of features 


such as categorical, temporal, textual, and contextual 
features (box 6) [2]. The feature selection [19] or instance- 
based learning [20] can be used at this time after feature 
extraction and before the train and test process. The feature 
vector sets will be divided into two parts for offline usage, 
including some vectors for training a machine learning (ML) 
algorithm and others for testing the ML (outputs of box 7). 
Now it is time to use an ML for training the features of 
duplicated pairs (box 8). The trained ML will then be used to 
predict the test set label (box 10). Four modes occurr here 
based on the prediction label and the real one, which is 
tabulated in Table 1. The validation performance metrics of 
the evaluation process can be calculated based on Table 1. 
There are many reliable and robust metrics for this purpose. 

Accuracy fefers to true predictions for duplication or non- 
duplication status of all bug reports (1). 


True Prediction (TP) 


Accuracy = 
y Total (TT) 


(1) 
Precision indicates the exactitude of duplication detection 
of duplicates as (2). 


True Duplicates (TD) 
Predicted Duplicates 


Precision = 


(2) 
Recall shows the memory of an ML algorithm for actual 
duplicates as (3). 


TD 
Actual Duplicates=TD+FND 


Recall = 


(3) 
F 1-measure or F1-score is a harmonic mean of precision 
and recall as (4). 


2xPrecisionxRecall 
F1— measure = —_——_ 


(4) 


Precision+Recall 


2. Preprocess: 
1- Null Values 
: 1, Dataset ‘>| 2- Convert Data Types 
‘of Bug Reports 3- Geaning Text 


10. Finding 
Duplicates 


6. Feature Extraction: 
1- Categorical 
2- Textual 
3-Contextual 


Figure 1. The methodology of Duplicate Bug Report Detection using Machine Learning Algorithms [15] 


Table 1. Modes of the duplicate detection 


Actual — 
[Predict | Actual Dup (AD) Actual Non-Dup (AND) Total Actual Status 
Predicted Duplicated True Dup (TD) False Dup (FD) AD = TD+FND 
Predicted Non-Duplicated False Non-Dup (FND) True Non-Dup (TND) AND = FD+TND 
ceed hi = ita ¥ Total (TT = 
Total Prediction True Prediction (TP=TD+TND) False Prediction (FP=FD+FND) TP+FP=AD+AND) 
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2.2. Feature extraction methods 

The most crucial bug reports’ fileds are textual which can not 
be used by machine learning techniques and need to be 
converted to nominal or numerical data which is called 
feature extraction. There are many feature extraction types in 
state-of-the-art, which can be categorized as: 


1. Textual features extract the similarity of textual fields of 
bug reports using natural language processing and 
information retrieval techniques. The tokenizing sentences 
and extracting words, removing useless and frequent words 
known as_ stop words, removing conjunctions and 
punctuation, removing redundant words, and stemming 
words to find the pure form of each noun or verb. The 
process of counting the same words in two bug reports 
requires pre-processing. 

The N-gram model compares the n-sequence-word of two 
text fields. Increasing n in n-gram indicates greater similarity 
between two documents. TF and IDF refer to the frequency 
of a term in a document and in a set of documents, 
respectively. Equation | and 2 are commonly used in a 
DBRD context, and the BM25F model is built based on these 
equations [4]. An occurrence of a term f in document d, 
which can be a textual field of a bug report in a bug triage 
system is checked through Equation 1. Parameter K is the 
number of textual fields in document d, and f is an index of 
the textual fields of a bug report. The weight factor wy is 
based on the importance of each text field, the length is the 
number of characters in term t, and average_lengthf is the 
average length of all words in this field. The importance of a 
term t of document D is calculated using Equation 2 in all 
bug reports of the software repository, which contains many 
documents d, and each document contains many terms ft. The 
result of BM25F is an aggregated value representing the 
weighted average of the TF and IDF approaches for all 
standard terms in both texts d and gq, and K1 is a constant for 
preventing division by zero in Equation 3. 


wf xoccurrences (d[f],t) 
~«Dgxlength () 
a 
averagelengthr 


TFp (t, d) = Dpet 


N 
IDF (t, D) = log l(dep:tedlf (2) 
BM25F.,+(d,q) = 
IDF (t, Total Text Fields of Bug Reports) 
Mted[f|na[f] 
TF p(t, alf]) 
K, + TFy(t,alf]) 

(3) 


There are two major text fields in bug reports: title and 
description. It should be noted that at least one of the title and 
description fields is non-empty [10]. Comparing the different 
combination of these two fields requires more computational 
overhead and is time-consuming for feature extraction. 

Sometimes, simple features can also be extracted from 
texts, such as text size (length of text in characters or number 


of words in the text) [21], which is shown in Equation 4 and 
where the norm (||) refers to the size of bug reports in words, 
and abs is the absolute value. There are many typos in bug 
reports [10], which have adverse side impacts on textual 
features and should be corrected as a pre-processing phase. 


SizeDif f (d,q) = abs(\d[f]| — la Lf1)) (4) 


The interconnected typos are usual in software bug reports 
because of the identity of variables and methods in the stack 
traces, or sometimes they record user-typing mistakes. Some 
algorithms proposed a correction of these typos [11, 12], but 
this phase is more complicated and needs additional effort to 
find the best candidates among the suggested corrections 
based on the context of a bug report. A new labeled dataset 
is introduced for typo corrections in the bug report context in 
which the correction algorithms have about 80% accuracy, 
and the effectiveness of typo correction on DBRD is an 
unresolved issue [13]. 

It is also possible to use other textual features, such as 
extracting the length of the longest common subsequence 
(LCS) in two texts as a textual feature and some other 
derived features (such as the number of words in LCS [5, 6]), 
or by using word embedding vectors [22]. The bag of words 
is another textual feature extraction method that considers 
different textual fields of bug reports as a bag and compares 
textual features of each bag with other bags. 

The time complexity of textual feature extraction methods 
is greater than other feature types. The bag of words 
produces many textual features that are very time- 
consuming. Additionally, the extracted features may be 
unnecessary and need dimension reduction to select the best 
features, which causes a further deceleration of the 
workflow, and is not used in state-of-the-art features [23]. 

The technique of word embedding has been used regularly 
to extract the frequency of each term considering nearby 
terms in a bug report textual field [22, 24]. This technique 
suffers from high dimensionality and a sparsity problem, 
because it considers all terms in the bug report repository as 
vectors and counts the frequency of each term in a specific 
bug report for nearby terms to convert the unstructured 
textual field to a numeric structured vector. This method is a 
type of word2vec model. It is a very time and memory 
intensive and is appropriate for training neural network 
models, especially deep models. The neural network models 
are especially appropriate for solving non-linear problems, 
but related works showed that DBRD is arule-based problem 
which can be solved by linear models as well [5, 7]. 
Therefore, it is better to avoid using this technique until it 
becomes necessary. 

2. Temporal feature is a type of feature that shows an 
interval time between two bug reports (Equation 5 and 6) in 
the seconds or milliseconds [3, 25]. Usually, when a new 
release of the software is published, many users report 
duplicate bugs, and so there is a relationship between the 
submission dates of bug reports. The lesser value of these 
features indicates the highest probability of similarity of two 
bug reports. However, some researchers use a timing 
window instead of temporal features to limit the search space 
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of the duplicate finder and find the duplication in a specific 
period [26]. 
fia(d, q) = abs(d. Bugld — q.BuglId) (5) 


foate(d, q) = abs(d. OpenDate — q. OpenDate) (6) 


3. Structural features are calculated based on runtime 
information [27] and stack traces [28] in bug reports. Only 
some bug reports have this type of information in their 
description; therefore, it is not possible to calculate these 
features for all of the bug reports. Textual similarity 
techniques can also be used to calculate these features; new 
methods convert the stack trace to a graph and extract some 
graph-based features like number of nodes, number of 
incoming and outgoing edges of nodes, and similar metrics. 
The hidden Markov model can also be used to investigate the 
similarity of chain on method calls in stack traces as a feature 
[29]. 


4. Categorical feature is a type of feature that shows how 
much two bug reports are related [4] using equality 
comparisons or subtraction of categorical fields. These 
features can be calculated by either checking the equality of 
two nominal values like (7), (8), and (9) or subtracting two 
ordinal or interval values like (10), (11), (12), and (13) in two 
bug reports d and g [4, 25]. Both (10) and (11) or (12) and 
(13) are similar, and both pairs always generate a number 
less than 1. However, (11) and (13) sometimes may be 
invalid because of division by zero, for the same priorities or 
variations, which can be considered as zero. Perhaps Lazar 
et al. [25] wrote or misused the equations, but these new 
features can also be studied. The letters “A” and “S” at the 


end of these equations refers to “Addition” and 
“Subtraction” in their denominators, respectively. 
1 if d. Product = q.Product 
d,q) = 7 
frroauct ( q) i otherwise ( ) 


1 if d.Company = g.Compan 
eC as pany = q. Company 


otherwise 
7 (8) 
_ 1 ifd.Type = q.Type 

d,q)= 9 

Frype(4, 4) f otherwise ©) 
1 

ferioritya(d, q) = GPaein =e) (10) 
= 1 

fer ioritys(d, q) P 1-|d.Priority—q.Priority| qd 1) 
1 

fversiona (a, q) = Paivoniouanvensianl (12) 
1 

versions (a, q) = (13) 


1-|d.Version—q.Version| 


5. Topical or contextual feature is a type of feature that is 
used to compare textual fields of a bug report with a word 
list containing exclusive content, like (1) security, (2) 
performance of software [30], (3) the anonymous topics 
made by latent Dirichlet analysis (LDA) [31], or (4) latent 
semantic indexing (LSI). The results obtained from these 
semi-textual features indicate how much the report involves 


specific contexts; thus, the conceptual category for bug 
reports. Contextual features of two bug reports can be 
compared as a vector by the cosine similarity equation or the 
Manhattan similarity individually to expand the feature 
space of bug reports [7]. 


2.3. Machine learning algorithms 

As Table 2 shows, so much effort has been done over the past 
decade to detect duplicate bug reports based on the above 
descriptions. The numbered columns refer to textual, 
identical, temporal, structural, categorical, and contextual 
features. Some features (textual, temporal, and categorical) 
are essential, and the duplicate detection process should 
mention them, while contextual features are less important 
[7]. Textual features can also cover structural features, even 
though structural features represent another aspect of 
similarity between bug reports. Further, all bug reports 
cannot calculate them, except those that have stack trace(s). 

Contextual features need contextual attributes as some 
derived attributes based on textual fields can be calculated 
and stored in preprocessing phase in clearning texts section 
to reduce feature extraction runtime. 

Most state-of-the-art approaches use ML algorithm. 
Almost all related works have focused on improving the 
validation performance using new feature extraction 
methods [5, 7, 8, 32, 33] and/or using various ML algorithms 
like deep learning [24, 34, 35, 36]. Table 2 shows a brief 
review of related and state-of-the-art works using ML 
algorithms. As Table 2 hows, none of these related works 
mentioned the search space and runtime challenges of 
duplication-checking, except a continuous query study that 
tried to improve validation performance on the continuous 
query as an online challenge [17]. The related works usually 
choose a part of pairs of bug reports randomly to evaluate 
their methods without considering runtime challenge, 
although if they want to compare a new bug report with the 
entire database, it was very time-consuming. As there is no 
related work for the DBRD runtime problem, the literature 
review is limited to state-of-the-art studies’ general 
parameters and features. 

Reviewing the literature showed that runtime challenge is 
considered for the first time for DBRD. Therefore, we will 
choose the best current methods for comparison. Besides, the 
literature review shows that the selected parameters for 
experiments are almost the same as state-of-the-art. 
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Table 2. Review of related works in state-of-the-art using machine learning algorithms 


ee Ref Year Machine Learning Algorithms Dataset Validation Metrics 
1 Bettenburg et al. [38] 2008 SVM, Naive Bayes, Eclipse Accuracy 
Eclipse, Mozilla, 
2 Sun et al. [39] 2010 SVM OpenOffice Recall 
: Eclipse, Mozilla, 
3 Nguyen et al. [31] 2012 LDA, Ensemble Averaging OpenOffice Accuracy 
: ‘ Fl-measure, TP and 
4 Tian et al. [40] 2012 SVM Mozilla TN Rates 
5 Liu et al. [41] 2013 SVM Eclipse, Mozilla Fl-measure, MAP 
Alipour et al. [30], 0-R, C4.5, KNN, Logistic Regression, F Accuracy, Kappa, 
6 [42] 2013 Naive Bayes ae ROC, AUC 
Accuracy, Precision, 
7 Feng et al. [43] 2013 SVM, Naive Bayes, Decision Tree MeeGo Recall, MAP, TP and 
TN Rates 
kKNN, Linear SVM, RBF SVM, Naive Eclipse, Mozilla, Accuracy, Precision, 
8 Lazar et al. [25] a Bayes, Decision Tree, Random Forest OpenOffice, NetBeans Recall, AUC 
9 Tsuruda et al. [44] | 2015 SVM Eclipse, OpenOffice gia ca 
Aggarwal et al. [8], 2015, 0-R, Naive Bayes, Logistic Eclipse, Mozilla, 
ae [9] 2017 Regression, SVM, C4.5 OpenOffice pecutacy, Kappa 
Sharma and Sharma . ROC, TP and FP 
11 [45] 2015 SVM Bugzilla Rates, 
: 0-R, C4.5, KNN, Logistic Regression, Android, Eclipse, Accuracy, Kappa, 
He || male cease) ||| 2016 Naive Bayes Mozilla, OpenOffice ROC, AUC, MAP 
13 Lin et al. [23] 2016 SVM Pee ert Recall 
14 Pasala et al. [47] 2016 kNN Chrome Recall 
Random Forest Eclipse, Mozilla, Precision, Recall, F1- 
B Rabnweral ee aune Bugzilla, SeaMonkey measure, AUC 
Siamese Convolutional Neural Eclipse, OpenOffice 
16 Deshmukh et al. [34] 2017 Networks (CNN), Long Short-Term Per " Accuracy, Recall 
NetBeans 
Memory (LSTM) 
17 Budhiraja et al. [24] 2018 Deep Neural Network Mozilla, OpenOffice Recall 
18 Su and Joshi [49] 2018 Logistic Regression Oracle Recall 
19 Xie et al. [36] 2018 Convolutional Neural Networks Aiea a a peeulacy Ele 
apReduce, Spark measure 
Naive Bayes, Decision Tree, Linear 
0 Soleimani Neysiani 2019 Regression, Perceptron Neural Android, Eclipse, Accuracy, Precision, 
and Babamir [5] Network, Bayesian Boosting by Mozilla, OpenOffice Recall 
Decision Tree 
Soleimani Neysiani Naive Bayes, Decision Tree, Linear Android, Eclipse, Accuracy, Precision, 
21 3 2019 Regression, Auto MLP, Bagging : : 
and Babamir [7] ‘TA Mozilla, OpenOffice Recall 
ensemble of Decision Tree 
Naive Bayes, k-Nearest 
Soleimani Neysiani Neighborhood, Decision Tree, Linear : Accuracy, Precision, 
- and Babamir [14] eek Regression, Auto Multi-Layer Android Recall 
Perceptron, Deep Learning with H2O 
Soleimani Neysiani k-Nearest Neighborhood, Linear . Accuracy, Precision, 
ee and Babamir [16] a0) Regression ended Recall, Fl Measure 
: 2 ar Linear Regression, Decision Tree, ; : ae 
24 Soleimani Neysiani et 2020 Auto Multi-Layer Perceptron, Deep Android, Eclipse, Accuracy, Precision, 
al. [50] L : : Mozilla, OpenOffice Recall, Fl Measure 
earning with H2O0 
Soleimani Neysiani et Linear Regression, Decision Tree, k- : ; Accuracy, Precision, 
25 al. [20] a Nearest Neighborhood Android, Mozilla, Recall 
Eclipse, Mozilla, Accuracy, Precision, 
26 Kukkar et al. [51] 2020 Deep Learning (CNN) OpenOffice, Gnome, Recall, Fl Measure, 
NetBeans, Firefox Recall @k 
: Naive Bayes, Random Forest, CNN, Eclipse, Mozilla, Accuracy, Precision, 
Silo |p epee ee (||| 202 LSTM, CNN+LSTM Apache, KDE Recall, Fl Measure 
Eclipse, Mozilla, 
28 Zhang et al. [53] 2022 Deep Learning (Dual Channel-CNN) Hadoop, Spark, Kibana, Recall @k 


VS Code 
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3. Proposed method 

Calculating the textual features is very time-consuming. The 
main idea of the proposed approach is dividing the duplicate 
detection process into two phases: 1) trying to predict the 
duplication status using light features like non-textual 
features, which can be calculated quickly; 2) predicting the 
duplication status using all features, including the textual 
features that are more time-consuming. Figure 2 shows the 
methodology of the proposed approach. 

Steps 1 to 5 as the pre-processing phase (red box) are 
similar to the traditional methodology of duplicate bug report 
detection (DBRD), but the splitting data for evaluation is 
held on Step 6 here (box 6). The pairs of bug reports must be 
divided into two different parts to train and test machine 
learning (ML) algorithms. It is better to split data samples by 
considering the distribution of duplicated pairs in both parts, 
which have the same percentage of duplicated pairs. After 
splitting the pairs of bug reports, the training phase (green 
box) starts, which uses the training pairs for feature 
extraction (box 6.1.2) but without textual features. Then an 
ML algorithm is used to train the non-textual features of pairs 
of bug reports (box 6.1.4). 

On the other hand, textual features of pairs must be 
extracted, too (box 6.1.6), and appended to non-textual 


2. Preprocess: 
1- Null Values 
2- Convert Data Types 
3- Geaning Text 


:of Bug Reports 


6.2.1. Testing 
Pairs Samples 


features (box 6.1.8) till another ML algorithm can be trained 
for all kinds of features, including textual and non-textual 
features (box 6.1.10). Now we have two ML algorithms as 
duplicate item finders (DIF). 

The test phase (blue box) will be started after the training 
phase is finished, and the test pairs will be feature extracted 
using non-textual features, too (boxes 6.2.1, 6.2.2, and 
6.2.3). It is time to evaluate testing pairs' features using the 
first light DIF (box 7). The results of ML algorithms usually 
contain two values: 1) The predicted label, which is the status 
of a pair of bug reports as duplicated or non-duplicated here; 
2) The confidence of the ML algorithm for this prediction. 
Every ML algorithm can predict the confidence level in a 
customized method. For example, the Naive Bayes 
confidence is the algorithm’s direct calculated probability 
when the confidence value is real. The k-NN confidence is 
the number of the k neighbors with the predicted class 
divided by k, and the single values are weighted by distance 
in weighted predictions. The SVM has a reasonable 
estimation of a binomial class problem's positive class (14), 
where the function_value is the SVM prediction. This 
approach is also used by the RapidMiner tool [54]. 


1 


Confidence = 1+e—function_value 


(14) 
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Figure 2. The methodology of Duplicate Bug Report Detection using the proposed two-step Classification Approach of Machine 
Learning Algorithms 
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Now it is time to check the confidence of predicted status 
(box 9). If the confidence is more than a specific threshold, 
e.g., 90%, the prediction can be accepted; otherwise, the 
textual features of that pair should be extracted too (box 10), 
and the combined features (boxes 12 and 13), including 
textual and non-textual features of that pair of the testing set 
must be used for second DIF (box 6.1.11), which considers 
all kinds of features to predict the status of this pair (box 14). 
Then the predicted status (box 15) is used to compare the real 
status of that pair (box 16) and evaluate the validation 
performance metrics of DIF as the final result of DBRD (box 
17). 

For example, consider following three real bug reports in 
Table 3 from Eclipse dataset [4, 46, 55], where the two first 
ones already exist in the bug reports database and the third 
one is a new or target bug report that is compared to other 
existing bug reports. Their identical fields include bug report 
id and master id which is the bug report id of main bug report 
for duplicate bug reports and it is null for those bug reports 
which are not duplicate. The categorical fields determine 
detail categories for each bug report like the software product 
and component, The status field shows the last state of bug 
report which can be new, assigned to developer for fixing, 
fixed, duplicate, and so on. The textual fields are the main 


fields of every bug report because they are the main fields to 
find uniqueness or duplication of each bug report. 

The contextual fields are derived from textual fields and 
as mentioned before, can be calculated and stored once time 
to be used later for feature extraction in comparison with 
other bug reports. There can be more contextual fields in 
various domains based on the software triage system 
modules or external aspects like software engineering topics. 
The selected example consider four general, networking, 
cryptography, and java conexts to calculate contextual fields, 
but it is possible to build dictionaries based on each module 
of software triage system and calculate contextual fields for 
those topics based on built dictionaries. 

Table 4 shows some extracted features for comparing the 
new bug report to existing bug reports in the database. The 
class label shows that the selected pair is really duplicate or 
not based on master ID field in the dataset. For training 
dataset, the master ID fields are filled, so the training dataset 
including training pairs have deterministic label. In test 
phase, the label should be predicted using machine learning 
algorithms. Various types of features based on state-of-the- 
art are calculated and determined in Table 4 and their name 
and equations are referenced in second column. The values 
of the features are shown in two last columns. 


Table 3. Real Sample Bug Reports Data 


Field Type Field Bug Report 1 Bug Report 2 Target Bug Report 
; Bug ID 240427 258365 258935 
Identical 
MasterID 233269 - 258365 
Product Equinox Equinox Equinox 
Component P2 P2 P2 
: Type Normal Major Normal 
Categorical re 
Priority 3 3 3 
Version 3.4 3:5 3.5 
Status Duplicate Fixed Duplicate 
- , Open Date (GMT) 11/7/2008 00:25:00 10/12/2008 21:33:00 16/12/2008 14:18:00 
empora 
Close Date (GMT) 14/7/2008 15:27:48 21/1/2010 07:12:35 18/12/2008 03:52:48 
; software update dialog / filter field [fwkadmin][shared] shared [shared] shared tests 
Title : : See anes < 
blocks user input install eclipse.ini not read are failing on mac 
menu: help->software updates displays 
a dialog with available software to 
install. in the top part there is a filter 
Sere eer r0081215 
Textual Ps : a arp 120081210-0800 when i testreadonlydropinsst 
ee the filter and whole dialog is block - : att 
Description sas : install something in shared artup and 
what blocks possibility to continue : : ; 
Oh : : : install i loose my p2 menus. | testuserdropinsstartu 
typing into the filter field. there is wre failith 
certain delay but it is too short to be P 8 
able to type in you filter expression. it 
almost imposible to use the filter field 
resonably. 
General 26.858 8.973 4.298 
Contextual 
(Derived Networking 23.838 6.821 4.298 
from Crypto graphy 21.514 5.501 2.325 
textual) 
Java 22.946 3.291 4.298 
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Table 4. Real sample extracted features 


Features Type Features Bias de oe petites 
vs Bug Report 2 
Label Duplicate No Yes 
famose-1G (3) 1.407 1.786 
Se fgmosr-2G (3) 1.270 2.395 
fsizenier (4) 386 17 
fics [5, 6] 83 59 
fia (5) 18508 570 
Temporal 
fpate (6) (sec) 13387812 34620875 
fproduct (7) 1 1 
feompany (8) 1 1 
fiype (9) 1 0 
Categorical fprioritya (10) I 1 
frrioritys (11) 1 1 
fVersiona (12) 0.9 1 
fVersions (13) 1.1 1 
Cosine 0.984 0.937 
General Manhattan 22.560 4.674 
Contextual (Distance) Networking Manhattan 19.539 2.522 
Cryptography Manhattan 19.188 3.175 
Java Manhattan 18.647 1.006 


It should be mentioned that textual features are very 
important for DBRD, but their time complexity is more than 
other feature types and depends on the length of texts. For 
example, the minimum, average and maximum text length of 
Eclipse dataset is 8, 1080, and 65,054 characters, and 2, 136, 
and 10,762 words, respectively. A pretest for comparing bug 
report 259801 with 697 characters and 130 words to more 
than 18,000 other bug reports shows that the minimum, 
average, and maximum runtime of all non-textual features 
calculation were 0, 2.8 and 100.8 micro seconds for each 
pair. These times for textual features were 0.6, 11.3 and 
968.1 milli seconds which are four thousand times more than 
non-textual features times. If the selected bug report length 
is more, the runtime will be increased a lot too. So, textual 
features are harmful for runtime performance, and useful for 
validation performance of DBRD. 

After feature extraction, the feature vectors including 
some numerical values and a label will be made and given to 
a ML algorithm to be trained and learn features of duplicated 
pairs. In the test phase, a feature vector will be provided in 
comparison of new bug report with other existing bug reports 
in the triage system. Then each feature vector will be given 
to the trained ML algorithm to predict its label. In the 
proposed approach, just non textual features will be 
calculated in the first stage and they will be given to simple 
or light or non-textual ML algorithm. Adside the predicted 
label by ML algorithm, the confidence of ML algorithm will 


be checked. If the confidence is more than a certain threshold 
(e.g., 90%), the predicted label based on non-textual features 
will be accepted and reported as the final result. Otherwise, 
the textual features will be calculated and appended to 
feature vector and the new full feature vector will be given 
to heavy or full ML algorithm and its result will be reported 
as the final result. 

Furthermore, the DBRD process is divided into two parts 
based on the textual features. There are some considerations 
to improve the validation performance of this two-step 
classification approach as a DBRD: 

1. Using more robust non-textual features to improve the 
validation performance of non-textual DIF, e.g., using more 
topics for contextual features [7]; 

2. Using robust and powerful ideas for ML algorithms of first 
DIF, e.g., ensemble algorithms like using some ML 
algorithms and voting their results to improve the validation 
performance of non-textual DIF; 

3. Using an ML algorithm like linear regression to predict 
the best value for the threshold of confidence checking step 
(box 9). 


4. Experimental results 

The traditional and proposed methodologies of duplicate bug 
report detection (DBRD) are implemented using Takelab 
script [56] in Python 3.8 for textual feature extraction and 
RapidMiner 9.5 [57] for implementing the machine learning 
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algorithms. The state-of-the-art approach is the most 
commonly used ML-based DBRD [5, 7, 8, 14, 20, 21, 30, 46, 
50]. In the first experiment, we use all the ML algorithms 
based on [50] as the best results between related works for 
comparison. Besides, the ID difference feature was the most 
important feature that improves the validation performance 
results a lot, so it was eliminated because there may be some 
biased judgment in considering a relation between ID 
difference and duplication status. However, the open date 
difference was kept as a temporal feature. The results were 
much realistic, and there is hope that there is no more 
difference between the proposed approach and the triager 
needs in the real world. Figure 3 shows the results of the 
comparison of various ML algorithms for different 
scenarios. The results of the scenario of simple features (S) 
must be worse than the scenario of full features (F), but, 
interestingly, the results of the scenario of two-step 
classification (TSC) are in the middle of other scenarios that 
are more than 87% and 89% for both accuracy and F1- 
measure metrics, respectively. The ML algorithm of TSC is 
a voting-based ensemble algorithm of other ML algorithms; 
that is, S-Vote for non-textual duplicate item finder (DIF) 
and F-Vote for full DIF. Even though the deep learning ML 
algorithm has better performance in both simple and full 
feature scenarios, the TSC just tested using the voting 
algorithm because deep learning training is time-consuming 
at ist improvement is less than one percent for both simple 
and full feature scenarios. 

Table 5 shows the experiments’ parameters. Various 
machine learning (ML) algorithms are chosen to compare 
their efficiency for DBRD in three different scenarios with 
non-textual features, full features, or the proposed two-step 
classification. Three scenarios are considered for evaluating 
the proposed method, including: 1) Simple or light scenario 
just including the non-textual features as an old approach of 
state-of-the-art; 2) Full feature as the current state-of-the-art 
approach; 3) Two-Step Classification (TSC) as the proposed 
approach. 

Moreover, the detailed properties of datasets are 
tabulated in the control variable section of The state-of-the- 
art approach is the most commonly used ML-based DBRD 
[5, 7, 8, 14, 20, 21, 30, 46, 50]. In the first experiment, we 
use all the ML algorithms based on [50] as the best results 
between related works for comparison. Besides, the ID 
difference feature was the most important feature that 
improves the validation performance results a lot, so it was 
eliminated because there may be some biased judgment in 
considering a relation between ID difference and duplication 
status. However, the open date difference was kept as a 
temporal feature. The results were much realistic, and there 
is hope that there is no more difference between the proposed 
approach and the triager needs in the real world. Figure 3 
shows the results of the comparison of various ML 
algorithms for different scenarios. The results of the scenario 
of simple features (S) must be worse than the scenario of full 
features (F), but, interestingly, the results of the scenario of 


two-step classification (TSC) are in the middle of other 
scenarios that are more than 87% and 89% for both accuracy 
and Fl-measure metrics, respectively. The ML algorithm of 
TSC is a voting-based ensemble algorithm of other ML 
algorithms; that is, S-Vote for non-textual duplicate item 
finder (DIF) and F-Vote for full DIF. Even though the deep 
learning ML algorithm has better performance in both simple 
and full feature scenarios, the TSC just tested using the 
voting algorithm because deep learning training is time- 
consuming at ist improvement is less than one percent for 
both simple and full feature scenarios. 

Table 5, which indicates the number of bug reports in 
each dataset and the number of selected bug report pairs in 
step four of both state-of-the-art and the proposed 
methodologies. The K-fold cross-validation is used for the 
evaluation of ML algorithms to avoid biased results. 
Moreover, various kinds of features are extracted for 
duplicate detection. 

The state-of-the-art approach is the most commonly used 
ML-based DBRD [5, 7, 8, 14, 20, 21, 30, 46, 50]. In the first 
experiment, we use all the ML algorithms based on [50] as 
the best results between related works for comparison. 
Besides, the ID difference feature was the most important 
feature that improves the validation performance results a 
lot, so it was eliminated because there may be some biased 
judgment in considering a relation between ID difference 
and duplication status. However, the open date difference 
was kept as a temporal feature. The results were much 
realistic, and there is hope that there is no more difference 
between the proposed approach and the triager needs in the 
real world. Figure 3 shows the results of the comparison of 
various ML algorithms for different scenarios. The results of 
the scenario of simple features (S) must be worse than the 
scenario of full features (F), but, interestingly, the results of 
the scenario of two-step classification (TSC) are in the 
middle of other scenarios that are more than 87% and 89% 
for both accuracy and Fl-measure metrics, respectively. The 
ML algorithm of TSC is a voting-based ensemble algorithm 
of other ML algorithms; that is, S-Vote for non-textual 
duplicate item finder (DIF) and F-Vote for full DIF. Even 
though the deep learning ML algorithm has better 
performance in both simple and full feature scenarios, the 
TSC just tested using the voting algorithm because deep 
learning training is time-consuming at ist improvement is 
less than one percent for both simple and full feature 
scenarios. 
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Table 5. The parameters of experiments 


ene Variable Name Variable States (Values) 
Classifier Linear Regression (LR), Decision Tree (DT), Random Forest (RF), Deep Learning with H2O 
(DL) [58, 59], Voting of all mentioned Classifiers as an Ensemble approach (Vote) 
Independent Simple (S): using the non-textual features 
Scenarios Full Features (F): as the traditional approach 
Two-Step Classification approach (TSC): the proposed approach 
Dataset Eclipse and Mozilla [4, 46, 55] 
Dataset # Bug Reports 
Number of Bug 2 
Reports Eclipse 45,234 
Mozilla 75,648 
Pairs— /Dataset ; None- 
| Duplicates Duplicates Total Dup% 
Control Number of Bug Eclipse 15,219 5,536 20,755 26.6% 
Paws Mozilla 40,537 14,297 54,834 26.0% 
Total 55,756 19,833 75,589 26.3% 
K-fold 10 
Stemming Is used 


Features ollection 


Temporal, Categorical, Contextual [7, 46], Textual [56] 


131,072 
65,536 
32,768 
16,384 


8,192 
4,096 
2,048 
1,024 
512 
256 
128 
64 

32 

16 


PN BO 


. Simple 
mTSC 
# Full 


Eclipse 


219.2 579.0 
1,698.9 15,141.3 
47,146.2 124,549.6 


Mozilla 


Figure 3. The runtime of three scenarios for both datasets based on seconds in logarithmic scale 


Table 6. The maximum performance of different machine learning algorithms for various kinds of scenarios of classification 


Scenario— State-ot-the-art Classification Two-step Classification Full Features [50] 
Dataset | Accuracy F1-measure Accuracy F1-measure Accuracy Fl-measure 
Eclipse 87.33% 77.67% 88.05% 86.43% 91.53% 84.75% 
Mozilla 84.26% 89.34% 87.34% 91.60% 90.98% 93.80% 
Average 85.80% 83.51% 87.70% 89.01% 91.26% 89.28% 
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Table 6 shows the detailed results on the experiments' 
maximum performance for each dataset. The results show 
that the TSC validation performance is almost in the middle 
of both scenarios in various datasets even though the TSC is 
implemented only using the Vote-based ML, but those are 
compared to their best ML algorithms. 

Although no one expects the results of TSC to be less than 
the results of simple scenario, it is time to know the impact 
of TSC on runtime performance. The number of pairs of bug 
reports is used for runtime comparison instead of execution 
time to eliminate hardware configuration impact on the 


93.00 
91.00 
89.00 
87.00 
85.00 
83.00 


81.00 


85.21 
82.90 


85.26 
83.48 


85.70 
83.41 


85.86 
83.46 


*s Accuracy 


= F1-Measure 


results and have a better insight about the time complexity 
improvement. Using a logarithmic scale to better show value 
contrast, Table 4 shows the number of used features using 
the first and second DIFs. The first DIF uses Simple Features 
for classification, and the second DIF uses Full Features, as 
mentioned in Table 3. The results show that many bug report 
pairs can be classified using the first classifier, and just a few 
pairs need the complex textual feature extraction phase. Non- 
textual features can be extracted in less than a millisecond, 
but textual feature sometimes takes more than 5 seconds to 
be extracted for just a pair based on their text lengths. 


Vote 


Vote | DL DT RF 


Two Step east 
Classification ee 
87.59 90.81 89.57 90.65 91.87 91.98 92.05 
87.65 88.53 87.15 88.92 90.21 90.26 90.36 


Figure 4. The average validation performance of various scenarios of Table 5 


65,536 
16,384 
4,096 
1,024 
256 

64 


‘x Simple Features 
& Full Features 


# Total Pairs 


Eclipse 
15,914 
690 5,416 
16,604 


Mozilla 


38,452 


43,868 


Figure 5. The number of bug reports which can be detected fast using simple features versus full features scenario 
(including textual features) 


Table 7. Percentage of bug reports predicted for classification 


Time Complexity Improvement . : Total Number of Bug 
= using Non- using All R t 
Textual Features Features oe 
/Dataset | (100%) 
Eclipse 95.85% 4.15% 16,604 
Mozilla 87.65% 12.35% 43,868 
Average of Results 89.90% 10.10% 
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The values are converted to percentage in Table 7. It 
shows that 87% and 95% of pairs of bug reports can be 
classified faster than the traditional approach using non- 
textual features for Eclipse and Mozilla datasets, 
respectively. These predictions’ average validation 
performance was 87% and 89% for accuracy and F1- 
measure, respectively, which are relatively more than many 
related works [48]. 

Furthermore, 89.9% of pairs of bug reports averagely in 
both datasets can be classified with more than 87% accuracy 
and Fl-measure, using simple checking of the categorical, 
temporal, and contextual features. The contextual features 
can be calculated. At first a new bug report is inserted in the 
repository to improve the performance of DBRD. So, the 
DBRD can be implemented merely using a SQL query in the 
repository for almost all bug reports, and those which are 
suspicious and need more checking, can be sent for textual 
feature extraction and give the full features vector of those to 
the full DIF. 


5. Conclusion 

This study focused on the runtime performance of the 
process of duplicate bug report detection (DBRD). A novel 
two-step classification method was proposed for DBRD, 
which uses non-textual features in the first step to check the 
duplication of a pair of bug reports. A machine learning 
(ML) algorithm is trained as a duplicate item finder (DIF) to 
predict the duplication status of non-textual feature vectors 
of pairs of bug reports. If the first DIF has low confidence in 
its prediction, the textual features should be extracted, and 
the second DIF is used to predict the status of the pair based 
on all features, especially textual features. The experiments 
show that the validation performance results of the proposed 
approach are better than those using the first non-textual DIF 
alone. Moreover, the runtime performance results of the 
proposed approach are better than using the second DIF 
alone. So, the proposed approach has a good runtime and 
validation performance in comparison with the traditional 
approaches. Every non-textual feature, like more contextual 
features, can improve the first DIF validation performance in 
the future. Also, the threshold of the first DIF for switching 
to the second DIF can be improved. Other datasets can be 
used to evaluate their validation performance. A semi- 
supervised [60] machine learning algorithm can be used for 
an incremental bug report repository of software triage 
systems. 
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